STA6235: Modeling in Regression
Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
Until now, we have discussed continuous predictors.
Now we will introduce the use of categorical, or qualitative, predictors.
This means that we will include predictors that categorize the observations.
If there are c classes in a categorical predictor, we include c-1 in the model.
e.g., underclassman status: freshman, sophomore, junior, senior
e.g., AKC-recognized pug color: fawn, black
We will create indicator variables to include in our model.
The c-1 predictors included in the model will be binary indicators for category. x_i = \begin{cases} 1 & \textnormal{if category $i$} \\ 0 & \textnormal{if another category} \end{cases}
In the underclassman status example, we could create four indicator variables, but we would only include three in our model.
While we will call them indicator variables, you will more often see them called dummy variables.
We could use mutate() and if_else() to define indicator variables.
However, we will use the dummy_cols() function from the fastDummies package.
Note that this will also create a column that indicates if the value is missing or not.
We represent a categorical variable with c classes with c-1 indicator variables in a model.
The last indicator variable not included is called the reference group.
How do we choose a reference group?
It depends on the story being told / what is of interest.
It does not affect the usefulness of the model, only the interpretations.
Let’s model the egg production data.
Suppose we are interested in modeling the number of eggs as a function of the number of hens, the production type, and the production process.
First, let’s see what happens when we do not use indicator variables.
Call:
lm(formula = n_eggs ~ n_hens + prod_type + prod_process, data = egg_prod)
Residuals:
Min 1Q Median 3Q Max
-784734808 -13132405 3119085 28065097 367227439
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.628e+08 6.149e+07 -5.900 1.4e-08 ***
n_hens 2.487e+01 9.605e-01 25.895 < 2e-16 ***
prod_typetable eggs 2.213e+08 2.521e+08 0.878 0.381
prod_processcage-free (non-organic) 7.137e+07 2.694e+08 0.265 0.791
prod_processcage-free (organic) 1.152e+08 2.963e+08 0.389 0.698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 124800000 on 215 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 3.336e+04 on 4 and 215 DF, p-value: < 2.2e-16
(Intercept) n_hens
-3.627694e+08 2.487230e+01
prod_typetable eggs prod_processcage-free (non-organic)
2.212627e+08 7.137026e+07
prod_processcage-free (organic)
1.152205e+08
R uses the “first group” as the reference group.
Production Type \to “hatching eggs”
Production Process \to “all”
It is fine to do this for “ease” - if I am just checking on significance, I do not necessarily use indicator variables.
What if they were stored as numeric instead?
Call:
lm(formula = n_eggs ~ n_hens + type_num + proc_num, data = egg_prod)
Residuals:
Min 1Q Median 3Q Max
-784886167 -12799691 3302015 26977353 367523740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.467e+08 3.495e+07 -18.502 < 2e-16 ***
n_hens 2.477e+01 1.613e-01 153.546 < 2e-16 ***
type_num 2.492e+08 4.206e+07 5.924 1.22e-08 ***
proc_num 4.125e+07 2.749e+07 1.500 0.135
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 124500000 on 216 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 4.468e+04 on 3 and 216 DF, p-value: < 2.2e-16
as.factor().
Call:
lm(formula = n_eggs ~ n_hens + as.factor(type_num) + as.factor(proc_num),
data = egg_prod)
Residuals:
Min 1Q Median 3Q Max
-784734808 -13132405 3119085 28065097 367227439
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.628e+08 6.149e+07 -5.900 1.4e-08 ***
n_hens 2.487e+01 9.605e-01 25.895 < 2e-16 ***
as.factor(type_num)2 2.213e+08 2.521e+08 0.878 0.381
as.factor(proc_num)2 7.137e+07 2.694e+08 0.265 0.791
as.factor(proc_num)3 1.152e+08 2.963e+08 0.389 0.698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 124800000 on 215 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 3.336e+04 on 4 and 215 DF, p-value: < 2.2e-16
Circling back to indicator variables, let’s construct our model using those.
First, we need to see what their names are.
Note the spaces!
m4 <- lm(n_eggs ~ n_hens + `prod_type_hatching eggs` + `prod_process_cage-free (non-organic)` + `prod_process_cage-free (organic)`, data = egg_prod)
summary(m4)
Call:
lm(formula = n_eggs ~ n_hens + `prod_type_hatching eggs` + `prod_process_cage-free (non-organic)` +
`prod_process_cage-free (organic)`, data = egg_prod)
Residuals:
Min 1Q Median 3Q Max
-784734808 -13132405 3119085 28065097 367227439
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.415e+08 3.106e+08 -0.456 0.649
n_hens 2.487e+01 9.605e-01 25.895 <2e-16
`prod_type_hatching eggs` -2.213e+08 2.521e+08 -0.878 0.381
`prod_process_cage-free (non-organic)` 7.137e+07 2.694e+08 0.265 0.791
`prod_process_cage-free (organic)` 1.152e+08 2.963e+08 0.389 0.698
(Intercept)
n_hens ***
`prod_type_hatching eggs`
`prod_process_cage-free (non-organic)`
`prod_process_cage-free (organic)`
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 124800000 on 215 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 3.336e+04 on 4 and 215 DF, p-value: < 2.2e-16
Note that when using indicator variables, it allows us to easily change the reference group.
Let’s change the reference groups for both categorical predictors
m5 <- lm(n_eggs ~ n_hens + `prod_type_table eggs` + `prod_process_cage-free (non-organic)` + `prod_process_all`, data = egg_prod)
summary(m5)
Call:
lm(formula = n_eggs ~ n_hens + `prod_type_table eggs` + `prod_process_cage-free (non-organic)` +
prod_process_all, data = egg_prod)
Residuals:
Min 1Q Median 3Q Max
-784734808 -13132405 3119085 28065097 367227439
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.475e+08 2.380e+08 -1.040 0.299
n_hens 2.487e+01 9.605e-01 25.895 <2e-16
`prod_type_table eggs` 2.213e+08 2.521e+08 0.878 0.381
`prod_process_cage-free (non-organic)` -4.385e+07 3.598e+07 -1.219 0.224
prod_process_all -1.152e+08 2.963e+08 -0.389 0.698
(Intercept)
n_hens ***
`prod_type_table eggs`
`prod_process_cage-free (non-organic)`
prod_process_all
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 124800000 on 215 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 3.336e+04 on 4 and 215 DF, p-value: < 2.2e-16
(Intercept) n_hens
-1.415066e+08 2.487230e+01
`prod_type_hatching eggs` `prod_process_cage-free (non-organic)`
-2.212627e+08 7.137026e+07
`prod_process_cage-free (organic)`
1.152205e+08
(Intercept) n_hens
-2.475489e+08 2.487230e+01
`prod_type_table eggs` `prod_process_cage-free (non-organic)`
2.212627e+08 -4.385025e+07
prod_process_all
-1.152205e+08
`prod_type_hatching eggs`
-221262740
`prod_type_table eggs`
221262740
c(c4["`prod_process_cage-free (non-organic)`"], c4["`prod_process_cage-free (organic)`"]) # ref: all`prod_process_cage-free (non-organic)` `prod_process_cage-free (organic)`
71370261 115220506
`prod_process_cage-free (non-organic)` prod_process_all
-43850245 -115220506
Organic vs. all (from m4) is the same relationship as all vs. organic (from m5).
The coefficients for the non-organic process are different
We are interested in knowing if the overall categorical variable is a significant predictor of the outcome.
If c = 2, we can use the results from summary().
If c \ge 3, we must use a partial F.
We will define the full model as the model with all predictors.
We will define the reduced model as the model without predictors relevant to the categorical variable.
Then, we will use anova() as follows:
In theory:
Hypotheses
H_0: \beta_{1}^* = \beta_{2}^* = ... = \beta_{s}^* = 0 in the model y = \beta_0 + \beta_1 x_1 + ... + \beta_q x_q + \beta_{1}^*x_1^* = \beta_{2}^*x_2^* = ... = \beta_{s}^*x_s^* + \varepsilon
H_1: at least one \beta_i^* \ne 0 in the model y = \beta_0 + \beta_1 x_1 + ... + \beta_q x_q + \beta_{1}^*x_1^* = \beta_{2}^*x_2^* = ... = \beta_{s}^*x_s^* + \varepsilon
Test Statistic and p-Value
F_0 = \frac{\left[ \text{SSReg(full)} - \text{SSReg(reduced)} \right]/s}{\text{MSE(full)}}
p = P(F_{s, n-q-s} \ge F_0)
Rejection Region
Annotations from another lifetime:
In practice:
Hypotheses
H_0: \beta_{1} = \beta_{2} = ... = \beta_{c-1} = 0 in the specified model
H_1: at least one \beta_i \ne 0 in the specified model
Test Statistic and p-Value
anova()Rejection Region
Hypotheses
H_0: \beta_{\text{non-organic}} = \beta_{\text{organic}} = 0 in the specified model
H_1: at least one \beta_i \ne 0 in the specified model
Test Statistic and p-Value
F_0 = 1.127
p = 0.326
Rejection Region
Conclusion/Interpretation
Fail to reject H_0.
There is not sufficient evidence to suggest that the process type is a significant predictor of egg production.
summary().
Call:
lm(formula = n_eggs ~ n_hens + `prod_type_hatching eggs` + `prod_process_cage-free (non-organic)` +
`prod_process_cage-free (organic)`, data = egg_prod)
Residuals:
Min 1Q Median 3Q Max
-784734808 -13132405 3119085 28065097 367227439
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.415e+08 3.106e+08 -0.456 0.649
n_hens 2.487e+01 9.605e-01 25.895 <2e-16
`prod_type_hatching eggs` -2.213e+08 2.521e+08 -0.878 0.381
`prod_process_cage-free (non-organic)` 7.137e+07 2.694e+08 0.265 0.791
`prod_process_cage-free (organic)` 1.152e+08 2.963e+08 0.389 0.698
(Intercept)
n_hens ***
`prod_type_hatching eggs`
`prod_process_cage-free (non-organic)`
`prod_process_cage-free (organic)`
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 124800000 on 215 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 3.336e+04 on 4 and 215 DF, p-value: < 2.2e-16
Hypotheses
H_0: \beta_{\text{egg type}} = 0
H_1: \beta_{\text{egg type}} \ne 0
Test Statistic and p-Value
t_0 = -0.878
p = 0.381
Rejection Region
Conclusion/Interpretation
Fail to reject H_0.
There is not sufficient evidence to suggest that the egg type is a significant predictor of number of eggs.
Including categorical predictors in the model varies the y-intercept.
We will plug in 0’s and 1’s to the terms representing the categorical predictor.
Let’s define predicted values for the different production processes.
(Intercept) n_hens
-1.415066e+08 2.487230e+01
`prod_type_hatching eggs` `prod_process_cage-free (non-organic)`
-2.212627e+08 7.137026e+07
`prod_process_cage-free (organic)`
1.152205e+08
egg_prod <- egg_prod %>%
mutate(y_hat_all = c4["(Intercept)"] + c4["n_hens"]*n_hens + c4["`prod_type_hatching eggs`"]*1 + c4["`prod_process_cage-free (non-organic)`"]*0 + c4["`prod_process_cage-free (organic)`"]*0,
y_hat_org = c4["(Intercept)"] + c4["n_hens"]*n_hens + c4["`prod_type_hatching eggs`"]*1 + c4["`prod_process_cage-free (non-organic)`"]*0 + c4["`prod_process_cage-free (organic)`"]*1,
y_hat_non_org = c4["(Intercept)"] + c4["n_hens"]*n_hens + c4["`prod_type_hatching eggs`"]*1 + c4["`prod_process_cage-free (non-organic)`"]*1 + c4["`prod_process_cage-free (organic)`"]*0)egg_prod %>% ggplot(aes(x = n_hens, y = n_eggs, color = prod_process)) +
geom_point() +
geom_line(aes(y = y_hat_all), color = "blue") +
geom_line(aes(y = y_hat_org), color = "hotpink") +
geom_line(aes(y = y_hat_non_org), color = "purple") +
xlim(0, 75000000) + ylim(0, 1750000000) +
labs(x = "Number of Hens",
y = "Number of Eggs",
color = "Production Process") +
theme_bw()Today we have introduced the concept of categorical predictors.
Always remember that if there are c categories, there will be c-1 predictor terms in the model.
When c = 2, we can use the test results from summary() to discuss significance.
When c \ge 3, we must use a partial F test (i.e., full/reduced ANOVA)
When including for visualization, must plug in 0’s and 1’s - no means, medians, etc.! We plug in only plausible values.
Today’s assignment:
Introduce species as a categorical predictor of penguin body mass.
Determine significance of predictors.
Reassess model assumptions under the new model.